Autoregressive Moving Average Models

source("../code/init/chapter_start.R")
knitr::read_chunk('../code/chapter/03_arma.R')

In this chapter we introduce a class of time series models that is flexible and among the most commonly used to describe stationary time series. This class is represented by the AutoRegressive Moving Average (ARMA) models which combine and include the autoregressive and moving average models seen in the previous chapter, which we first discuss in further detail before introducing the general ARMA class. The importance of the latter class of models resides in its flexibility as well as its capacity of describing (or closely approximating) almost all the features of a stationary time series. The autoregressive parts of these models describe how consecutive observations in time influence each other while the moving average parts capture some possible unobserved shocks thereby allowing to model different phenomena going from biology to finance.

To introduce and explain this class of models, this chapter is organized in the following manner. First of all we will discuss the class of linear processes of which the ARMA models are part of. We will then proceed to a detailed description of autoregressive models in which we review their definition, explain their properties, introduce the main estimation methods for their parameters and highlight the diagnostic tools which can help understand if the estimated models appear to be appropriate or sufficient to well describe the observed time series. Once this is done, we will then use most of the results given for the autoregressive models to further desribe and discuss the moving average models, for which we underline the property of invertibility, and finally the ARMA models. Indeed, the properties and estimation methods for the latter class are directly inherited from the discussions on the autoregressive and moving average models.

Linear Processes

INSERT AN INTRODUCTION TO WHAT LINEAR PROCESSES ARE AND WHY THEY ARE IMPORTANT

```{definition, label="lp", name="Linear Process"} A time series, $(X_t)$, is defined to be a linear process if it can be expressed as a linear combination of white noise by:

[{X_t} = \mu + \sum\limits_{j = - \infty }^\infty {{\psi j}{W{t - j}}} ]

where $W_t \sim WN(0, \sigma^2)$ and $\sum\limits_{j = - \infty }^\infty {\left| {{\psi _j}} \right|} < \infty$.

Note, the latter assumption is required to ensure that the series has a limit. 
Furthermore, the set of coefficients \[{\{ {\psi _j}\} _{j =  - \infty , \cdots ,\infty }}\]
can be viewed as linear filter. These coefficients do not have to be all equal
nor symmetric as later examples will show. Generally, the properties of a linear
process related to mean and variance are given by:

\[\begin{aligned}
\mu_{X} &= \mu \\
\gamma_{X}(h) &= \sigma _W^2\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{\psi _{h + j}}}
\end{aligned}\]

The latter is derived from 

\[\begin{aligned}
  \gamma \left( h \right) &= Cov\left( {{x_t},{x_{t + h}}} \right) \\
   &= Cov\left( {\mu  + \sum\limits_{j =  - \infty }^\infty  {{\psi _j}{w_{t - j}}} ,\mu  + \sum\limits_{j =  - \infty }^\infty  {{\psi _j}{w_{t + h - j}}} } \right) \\
   &= Cov\left( {\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{w_{t - j}}} ,\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{w_{t + h - j}}} } \right) \\
   &= \sum\limits_{j =  - \infty }^\infty  {{\psi _j}{\psi _{j + h}}Cov\left( {{w_{t - j}},{w_{t - j}}} \right)}  \\
   &= \sigma _w^2\sum\limits_{j =  - \infty }^\infty  {{\psi _j}{\psi _{j + h}}}  \\ 
\end{aligned} \]

Within the above derivation, the key is to realize that 
$Cov\left( {{w_{t - j}},{w_{t + h - j}}} \right) = 0$ if $t - j \ne t + h - j$.

Lastly, one of the better formalities of a linear process is that it is able 
to be represented in a simplified form under the **backshift operator**:

\[{X_t} = \psi \left( B \right){W_t}\]

```{example, label="lpwn", name="Linear Process of White Noise"}

The white noise process $\left\{X_t\right\}$, defined in \@ref(wn),
can be expressed as a linear process under:

\[\psi _j = \begin{cases}
      1 , &\mbox{ if } j = 0\\
      0 , &\mbox{ if } |j| \ge 1
\end{cases}.\]

and $\mu = 0$.

Therefore, $X_t = W_t$, where $W_t \sim WN(0, \sigma^2_W)$

```{example, label="lpma1", name="Linear Process of Moving Average Order 1"}

Similarly, consider $\left{X_t\right}$ to be a MA(1) process, given by \@ref(ma1). The process can be expressed linearly under:

[\psi _j = \begin{cases} 1, &\mbox{ if } j = 0\ \theta , &\mbox{ if } j = 1 \ 0, &\mbox{ if } j \ge 2 \end{cases}.]

and $\mu = 0$.

Thus, we have: $X_t = W_t + \theta W_{t-1}$

```{example, label="lpsma", name="Linear Process and Symmetric Moving Average"}

Consider a symmetric moving average given by:

\[{X_t} = \frac{1}{{2q + 1}}\sum\limits_{j =  - q}^q {{W_{t + j}}} \]

Thus, $\left\{X_t\right\}$ is defined for $q + 1 \le t \le n-q$. The above process
would be a linear process since:

\[\psi _j = \begin{cases}
      \frac{1}{{2q + 1}} , &\mbox{ if } -q \le j \le q\\
      0 , &\mbox{ if } |j| > q
\end{cases}.\]

and $\mu = 0$.

In practice, if $q = 1$, we would have:

\[{X_t} = \frac{1}{3}\left( {{W_{t - 1}} + {W_t} + {W_{t + 1}}} \right)\]

```{example, label="lpar1", name="Linear Process of Autoregressive Process of Order 1"} A more intensive application of a linear process is $\left{X_t\right}$ as an AR1 process, defined in \@ref(ar1). The intensity comes from the necessity to define the weights with respect to the time lag.

[\psi _j = \begin{cases} \phi^j , &\mbox{ if } j \ge 0\ 0 , &\mbox{ if } j < 0 \end{cases}.]

and $\mu = 0$.

Under the condition that $\left| \phi \right| < 1$ the process can be considered to be the traditional $X_t = \phi X_{t-1} + W_t$. Furthermore, the process can also be considered to be an MA($\infty$)!

## Linear Operators


## Autoregressive Models (AR(p))

The class of autoregressive models is based on the idea that previous values in the time series are needed to explain current values in the series. For this class of models, we assume that the $p$ previous observations are needed for this purpose and we therefore denote this class as AR($p$). In the previous chapter, the model we introduced was an AR(1) in which only the immediately previous observation is needed to explain the following one and therefore represents a particular model which is part of the more general class of AR(p) models.

```{definition, label="ar_p","Autoregressive Models of Order P"}
The AR(p) models can be formally represented as follows
$${X_t} = {\phi_1}{Y_{t - 1}} + ... + {\phi_p}{X_{t - p}} + {W_t},$$
where $\phi_p \neq 0$ and $W_t$ is a (Gaussian) white noise process with variance $\sigma^2$. 

In general, we will assume that the expectation of the process $({X_t})$, as well as that of the following ones in this chapter, is zero. The reason for this simplification is that if $\mathbb{E} [ X_t ] = \mu$, we can define an AR process around $\mu$ as follows:

$$X_t - \mu = \sum_{i = 1}^p \left(\phi_i X_{t-i} - \mu \right) + W_t,$$

which is equivalent to

$$X_t = \mu^{\star} + \sum_{i = 1}^p \phi_i X_{t-i} + W_t,$$

where $\mu^{\star} = \mu (1 - \sum_{i = 1}^p \phi_i)$. Therefore, to simplify the notation we will generally consider only zero mean processes, since adding means (as well as other deterministic trends) is easy.

A useful way of representing AR processes is through the backshift operator introduced in the previous section and is as follows

[\begin{aligned} {X_t} &= {\phi_1}{X_{t - 1}} + ... + {\phi_p}{y_{t - p}} + {w_t} \ &= {\phi_1}B{X_t} + ... + {\phi_p}B^p{X_t} + {W_t} \ &= ({\phi_1}B + ... + {\phi_p}B^p){X_t} + {W_t} \ \end{aligned},]

which finally yields

$$(1 - {\phi _1}B - ... - {\phi_p}B^p){X_t} = {W_t},$$

which, in abbreviated form, can be expressed as

$$\phi(B){X_t} = W_t.$$

We will see that $\phi(B)$ is important to establish the stationarity of these processes and is called the autoregressive operator. Moreover, this quantity is closely related to another important property of AR processes called causality. Before formally defining this new property, we consider the following example, which provides an intuitive illustration of its importance.

Example: Consider a classical AR(1) model with $|\phi| > 1$. Such a model could be expressed as

$$X_t = \phi^{-1} X_{t+1} - \phi^{-1} W_t = \phi^{-k} X_{t+k} - \sum_{i = 1}^{k-1} \phi^{-i} W_{t+i}.$$

Since $|\phi| > 1$, we obtain

$$X_t = - \sum_{i = 1}^{\infty} \phi^{-j} W_{t-j},$$

which is a linear process and therefore is stationary. Unfortunately, such a model is useless because we need the future to predict the future and such processes are called non-causal.

Properties of AR(p) models

In this section we will describe the main property of the AR(p) model which has already been introduced in the previous paragraphs and therefore let us now introduce the property of causality in a more formal manner.

Definition: An AR(p) model is causal, if the time series ${ X_t }{-\infty}^{\infty}$ can be written as a one-sided linear process: \begin{equation} X_t = \sum{j = 0}^{\infty} \psi_j W_{t-j} = \frac{1}{\phi(B)} W_t = \psi(B) W_t, (#eq:causal) \end{equation} where $\phi(B) = \sum_{j = 0}^{\infty} \phi_j B^j$, and $\sum_{j=0}^{\infty}|\phi_j| < \infty$; we set $\phi_0 = 1$.

As discussed earlier this condition implies that only the past values of the time series can explain the future values of it, and not viceversa. However, it might be difficult and not obvious to show the causality of an AR(p) process by using the above definitions directly, thus the following properties are useful in practice.

Property: Causality If an AR(p) model is causal, then the coefficients of the one-sided linear process given in (\@ref(eq:causal)) can be obtained by solving \begin{equation} \psi(z) = \frac{1}{\sum_{j=0}^{\infty} \phi_j z^j} = \frac{1}{\phi(z)}, \mbox{ } |z| \leq 1. \end{equation}

It can be seen how there is no solution to the above equation if $\phi(z) = 0$ and therefore an AR(p) is causal if and only if $\phi(z) \neq 0$. A condition for this to be respected is for the roots of $\phi(z) = 0$ to lie outside the unit circle.

Estimation of AR(p) models

Given the above defined properties of the AR(p) models, we will now discuss how these models can be estimated, more specifically how the $p+1$ parameters can be obtained from an observed time series. Indeed, a reliable estimation of these models is necessary in order to intepret and describe different natural phenomena and/or forecast possible future values of the time series.

A first approach builds upon the earlier definition of the AR(p) as being a linear process. Recall that \begin{equation} X_t = \sum_{j = 1}^{p} \phi_j X_{t-j} \end{equation} which delivers the following autocovariance function \begin{equation} \gamma(h) = cov(X_{t+h}, X_t) = cov(\sum_{j = 1}^{p} \phi_j X_{t-j}, X_t) = \sum_{j = 1}^{p} \phi_j \gamma(h-j), \mbox{ } h \geq 0. \end{equation} Rearranging the above expressions we obtain the following general equations \begin{equation} \gamma(h) - \sum_{j = 1}^{p} \phi_j \gamma(h-j) = 0, \mbox{ } h \geq 1 \end{equation} and, recalling that $\gamma(h) = \gamma(-h)$, \begin{equation} \gamma(0) - \sum_{j = 1}^{p} \phi_j \gamma(j) = \sigma_w^2. \end{equation} We can now define the Yule-Walker equations.

Definition: The Yule-Walker equations are given by \begin{equation} \gamma(h) = \phi_1 \gamma(h-1) + ... + \phi_p \gamma(h-p), \mbox{ } h = 1,...,p \end{equation} and \begin{equation} \sigma_w^2 = \gamma(0) - \phi_1 \gamma(1) - ... - \phi_p \gamma(p). \end{equation} which in matrix notation can be defined as follows \begin{equation} \Gamma_p \mathbf{\phi} = \mathbf{\gamma}_p \text{and} \sigma_w^2 = \gamma(0) - \mathbf{\phi}'\mathbf{\gamma}_p \end{equation} where $\Gamma_p$ is the $p\times p$ matrix containing the autocovariances $\gamma(k-j), j,k = 1, ...,p$ while $\mathbf{\phi} = (\phi_1,...,\phi_p)'$ and $\mathbf{\gamma}_p = (\gamma(1),...,\gamma(p))'$ are $p\times 1$ vectors.

Considering the Yule-Walker equations, it is possible to use a method of moments approach and simply replace the theoretical quantities given in the previous definition with their empirical (estimated) counterparts that we saw in the previous chapter. This gives us the following Yule-Walker estimators \begin{equation} \hat{\mathbf{\phi}} = \hat{\Gamma}_p^{-1}\hat{\mathbf{\gamma}}_p \text{and} \hat{\sigma}_w^2 = \hat{\gamma}(0) - \hat{\mathbf{\gamma}}_p'\hat{\Gamma}_p^{-1}\hat{\mathbf{\gamma}}_p \end{equation}

These estimators have the following asymptotic properties.

Property: Consistency and Asymptotic Normality of Yule-Walker estimators The Yule-Walker estimators for a causal AR(p) model have the following asymptotic properties:

\begin{equation} \sqrt{T}(\hat{\mathbf{\phi}}- \mathbf{\phi}) \xrightarrow{\mathcal{D}} \mathcal{N}(\mathbf{0},\sigma_w^2\Gamma_p^{-1}) \text{and} \hat{\sigma}_w^2 \xrightarrow{\mathcal{P}} \sigma_w^2 . \end{equation}

Therefore the Yule-Walker estimators have an asymptotically normal distribution and the estimator of the innovation variance is consistent. Moreover, these estimators are also optimal for AR(p) models, meaning that they are also efficient. However, there exists another method which allows to achieve this efficiency also for general ARMA models and this is the maximum likelihood method. Considering an AR(1) model as an example, and assuming without loss of generality that the expectation is zero, we have the following representation of the AR(1) model \begin{equation} X_t = \phi X_{t-1} + w_t \end{equation} where $|\phi|<1$ and $w_t \overset{iid}{\sim} \mathcal{N}(0,\sigma_w^2)$. Supposing we have observations issued from this model $(x_t){t=1,...,T}$, then the likelihood function for this setting is given by \begin{equation} L(\phi,\sigma_w^2) = f(\phi,\sigma_w^2|x_1,...,x_T) \end{equation} which, for an AR(1) model, can be rewritten as follows \begin{equation} L(\phi,\sigma_w^2) = f(x_1)f(x_2|x_1)\cdot \cdot \cdot f(x_T|x_{T-1}). \end{equation} If we define $\Omega_t^p$ as the information contained in the previous $p$ observations to time $t$, the above can be generalized for an AR(p) model as follows \begin{equation} L(\phi,\sigma_w^2) = f(x_1,...,x_p)f(x_{p+1}|\Omega_{p+1}^p)\cdot \cdot \cdot f(x_T|\Omega_{T-1}^p) \end{equation} where $f(x_1,...,x_p)$ is the joint probability distribution of the first $p$ observations. A discussion on how to find $f(x_1,...,x_p)$ will be presented in the following paragraphs based on the approach to find $f(x_1)$ in the AR(1) likelihood. Going back to the latter, we know that $x_t|x{t-1} \sim \mathcal{N}(\phi x_{t-1},\sigma_w^2)$ and therefore we have that \begin{equation} f(x_t|x_{t-1}) = f_w(x_t - \phi x_{t-1}) \end{equation} where $f_w(\cdot)$ is the distribution of $w_t$. This rearranges the likelihood function as follows \begin{equation} L(\phi,\sigma_w^2) = f(x_1)\prod_{t=2}^T f_w(x_t - \phi x_{t-1}) \end{equation} where $f(x_1)$ can be found through the causal representation \begin{equation} x_1 = \sum_{j=0}^{\infty} \phi^j w_{1-j} \end{equation} which implies that $x_1$ follows a normal distribution with zero expectation and a variance given by $\frac{\sigma_w^2}{(1-\phi^2)}$. Based on this, the likelihood function of an AR(1) finally becomes \begin{equation} L(\phi,\sigma_w^2) = (2\pi \sigma_w^2)^{-\frac{n}{2}} (1 - \phi)^2 \exp \left(-\frac{S(\phi)}{2 \sigma_w^2}\right) \end{equation} with $S(\phi) = (1-\phi)^2 x_1^2 + \sum_{t=2}^T (x_t -\phi x_{t-1})^2$. Once the derivative of the logarithm of the likelihood is taken, the minimization of the negative of this function is usually done numerically. However, if we condition on the initial values, the AR(p) models are linear and, for example, we can then define the conditional likelihood of an AR(1) as \begin{equation} L(\phi,\sigma_w^2|x_1) = (2\pi \sigma_w^2)^{-\frac{n-1}{2}} \exp \left(-\frac{S_c(\phi)}{2 \sigma_w^2}\right) \end{equation} where \begin{equation} S_c(\phi) = \sum_{t=2}^T (x_t -\phi x_{t-1})^2 . \end{equation} The latter is called the conditional sum of squares and $\phi$ can be estimated as a straightforward linear regression problem. Once an estimate $\hat{\phi}$ is obtained, this can be used to obtain the conditional maximum likelihood estimate of $\sigma_w^2$ \begin{equation} \hat{\sigma}_w^2 = \frac{S_c(\hat{\phi})}{(n-1)} . \end{equation}

The estimation methods presented so far are standard ones for these kind of models. Nevertheless, if the data suffers from some form of contamination, these methods can become highly biased. For this reason, some robust estimators are available to limit this problematic if there are indeed outliers in the observed time series. A first solution is given by the estimator proposed in Kunsch (1984) who underlines that the MLE score function of an AR(p) is given by \begin{equation} \kappa(\mathbf{\theta}|x_j,...x_{j+p}) = \frac{\partial}{\partial \mathbf{\theta}} (x_{j+p} - \sum_{k=1}^p \phi_k x_{j+p-k})^2 \end{equation} which delivers the estimating equation \begin{equation} \sum_{j=1}^{n-p} \kappa (\mathbf{\theta}|x_j,...x_{j+p}) = 0 . \end{equation} The score function



SMAC-Group/TTS documentation built on May 9, 2019, 11:15 a.m.